Skip to content

Conversation

@jhamman
Copy link
Member

@jhamman jhamman commented Oct 20, 2025

Summary

Adds support for RectilinearChunkGrid extension (Zarr v3), enabling arrays with variable chunk sizes per dimension.

Closes: #1595 | Replaces: #1483 | Related: zarr-extensions#25

Key Features

RectilinearChunkGrid

arr = zarr.create_array(
    shape=(60, 100),
    chunks=[[10, 20, 30], [25, 25, 25, 25]],
    zarr_format=3
)
  • Zarr v3 only (not compatible with v2, sharding, or from_array())
  • Supports RLE in JSON metadata: [[10, 6]] = 6 chunks of size 10
  • Stored internally in expanded format for fast indexing

Chunk Grid Access

grid = arr.chunk_grid                    # Returns ChunkGrid instance
grid.chunk_shapes                        # ((10, 20, 30), (25, 25, 25, 25))
isinstance(grid, RectilinearChunkGrid)   # Type-safe checking

.chunks Property Behavior

# RegularChunkGrid: returns tuple with FutureWarning (deprecated)
arr.chunks  # (10, 10)

# RectilinearChunkGrid: raises NotImplementedError
arr.chunks  # Use .chunk_grid instead

Design Decisions

Decision Rationale
ChunksLike as TypeAlias Flexible input types without runtime overhead
ResolvedChunkSpec as frozen dataclass Named access, immutability, IDE support
Standalone validation functions Testability, clear error messages, early validation
.chunks raises for rectilinear No sensible single-tuple representation; guides users to .chunk_grid

Removed from Earlier Designs

Item Reason
RegularChunks/RectilinearChunks tuple subclasses Rejected - unnecessary complexity
Named dimension access (chunks.lat) Removed per review feedback
ChunksType ABC hierarchy Not implemented - TypeAlias approach preferred

Deferred / TODO Items

Item Location Notes
update_shape() optional chunks parameter metadata/v3.py:483 Allow specifying new chunk sizes when resizing instead of default heuristic
Validation function placement chunk_grids.py:1513-1593 Reviewer suggested moving to metadata module; kept for testability

Review Focus Areas

High Priority:

  • chunk_grids.py: RectilinearChunkGrid class, ChunksLike type, RLE expansion/compression, resolve_chunk_spec()
  • metadata/v3.py: update_shape() for rectilinear resize behavior
  • indexing.py: Variable chunk indexing with binary search
  • array.py: .chunks property behavior, .chunk_grid property

Tests:

  • test_chunk_grids/test_rectilinear.py: Comprehensive unit tests
  • test_chunk_grids/test_rectilinear_integration.py: End-to-end scenarios
  • testing/strategies.py: Hypothesis strategies for property-based testing

Breaking Changes

None. Fully backward compatible.

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.md
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Oct 20, 2025
@codecov
Copy link

codecov bot commented Oct 20, 2025

Codecov Report

❌ Patch coverage is 78.85533% with 133 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.65%. Comparing base (b712f96) to head (41db2dc).

Files with missing lines Patch % Lines
src/zarr/core/chunk_grids.py 74.49% 76 Missing ⚠️
src/zarr/core/array.py 75.00% 20 Missing ⚠️
src/zarr/core/indexing.py 85.07% 20 Missing ⚠️
src/zarr/testing/strategies.py 89.28% 9 Missing ⚠️
src/zarr/core/metadata/v2.py 72.72% 6 Missing ⚠️
src/zarr/core/_info.py 0.00% 1 Missing ⚠️
src/zarr/core/metadata/v3.py 90.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3534      +/-   ##
==========================================
+ Coverage   60.94%   61.65%   +0.71%     
==========================================
  Files          86       86              
  Lines       10268    10769     +501     
==========================================
+ Hits         6258     6640     +382     
- Misses       4010     4129     +119     
Files with missing lines Coverage Δ
src/zarr/api/asynchronous.py 72.20% <ø> (ø)
src/zarr/api/synchronous.py 36.61% <ø> (ø)
src/zarr/core/group.py 70.27% <ø> (ø)
src/zarr/core/_info.py 51.80% <0.00%> (ø)
src/zarr/core/metadata/v3.py 59.91% <90.00%> (+1.90%) ⬆️
src/zarr/core/metadata/v2.py 60.31% <72.72%> (+2.17%) ⬆️
src/zarr/testing/strategies.py 94.18% <89.28%> (-3.66%) ⬇️
src/zarr/core/array.py 67.99% <75.00%> (-0.12%) ⬇️
src/zarr/core/indexing.py 70.19% <85.07%> (+0.73%) ⬆️
src/zarr/core/chunk_grids.py 70.70% <74.49%> (+8.40%) ⬆️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Oct 20, 2025


@dataclass(frozen=True)
class RectilinearChunkGrid(ChunkGrid):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thoughts on just calling this class Rectilinear, and renaming the RegularChunkGrid to Regular? We could keep around a RegularChunkGrid class for compatibility. But I feel like people know these are chunk grids when they import them

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

50/50. I think the more descriptive class is useful when looking at a tracebacks. Plus, this is currently in .core so its not meant to be used directly by users.

@given(data=st.data())
async def test_basic_indexing(data: st.DataObject) -> None:
zarray = data.draw(simple_arrays())
@given(data=st.data(), zarray=st.one_of([simple_arrays(), complex_chunked_arrays()]))
Copy link
Contributor

@dcherian dcherian Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the search space for the standard arrays strategy is so large, i made a different one complex_chunked_arrays that purely checks different chunk grids
with simple_arrays() we are only spending 10% of our time trying RectilinearChunkGrid so using this approach. We should boost number of examples too.

Comment on lines +668 to +669
2. **Not compatible with sharding**: You cannot use variable chunking together with
the sharding feature. Arrays must use either variable chunking or sharding, but not both.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope this is a temporary limitation! There's a natural extension of rectilinear chunk grids to rectilinear shard grids.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@keewis
Copy link
Contributor

keewis commented Jan 8, 2026

is it intentional that the chunk grid is RectilinearChunkGrid when explicitly enumerating regular chunk sizes?

import zarr

store = zarr.storage.MemoryStore()
arr = zarr.create_array(
    store,
    shape=(60, 100),
    chunks=[[20, 20, 20], [25, 25, 25, 25]],
    zarr_format=3,
    dtype="uint8"
)
arr.metadata.chunk_grid # RectilinearChunkGrid

Or is RegularChunkGrid only chosen if I specify a tuple of values, in this case (20, 25)?

Not sure how much this matters in practice (xarray's code does not appear to be affected by the difference), but that's what dask normalizes the chunks to.

@jhamman
Copy link
Member Author

jhamman commented Jan 14, 2026

is it intentional that the chunk grid is RectilinearChunkGrid when explicitly enumerating regular chunk sizes?

@keewis - yes, this is intentional. We decided this was preferable because it will allow the user to add variable lengths later (through an extend/append workflow).

@codspeed-hq
Copy link

codspeed-hq bot commented Jan 16, 2026

CodSpeed Performance Report

Merging this PR will not alter performance

Comparing jhamman:feature/rectilinear-chunk-grid (665847a) with main (e94504e)

Summary

✅ 30 untouched benchmarks



@dataclass(frozen=True)
class ResolvedChunkSpec:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to be a class? It has no methods. Seems like either a TypedDict or just a tuple would work

shards: tuple[int, ...] | None


def _validate_zarr_format_compatibility(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this here? Can't we check when we construct the array metadata document?

)


def _validate_sharding_compatibility(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be defined over in the array v3 metadata module, and used there to check that the array metadata document is valid?

)


def _validate_data_compatibility(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to be stand-alone function? are we going to use this logic anywhere other than it's current usage site (resolve_chunk_spec)

@@ -436,6 +468,21 @@ def to_dict(self) -> dict[str, JSON]:
return out_dict

def update_shape(self, shape: tuple[int, ...]) -> Self:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this needs to take a parameter that can define the new chunks (which can default to None or some other sentinel flagging "default behavior" semantics)

@d-v-b d-v-b mentioned this pull request Jan 28, 2026
@tomwhite
Copy link
Member

FYI I tested this PR for implementing rechunking with variable-sized intermediate chunks in Cubed - and it worked!

I found one wrinkle in that zarr.create_array supports rectilinear chunk grids, but zarr.open (with mode="w") doesn't. But that could be addressed later.

@jhamman jhamman closed this Feb 3, 2026
@jhamman jhamman reopened this Feb 3, 2026
@d-v-b
Copy link
Contributor

d-v-b commented Feb 4, 2026

I will give this a spin today, but assuming everything works, my question is: how can we ship this soon (this week?) while making it clear that the feature is experimental?

@tinaok
Copy link

tinaok commented Feb 4, 2026

I have varient chunked array;

import xarray as xr
from healpix_geo import nested 
import numpy as np

da = xr.open_zarr(
    "https://data-taos.ifremer.fr/GRID4EARTH/no_chunk_healpix.zarr",
    consolidated=True,   # if metadata is consolidated
).da

depth = da.cell_ids.attrs['level']
new_depth = depth-6
parents = nested.zoom_to(da.cell_ids, depth=depth, new_depth=new_depth) 
_, chunk_sizes =np.unique(parents, return_counts=True)
da.chunk({"cell_ids": tuple(chunk_sizes.tolist())})

Happy to experiment to write it using this experimental feature(if it works with xarray...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants